Introduction

Regression analysis is a simple yet powerful method to find the relations within a dataset. In this post, we will look at the insurance charges data obtained from Kaggle (https://www.kaggle.com/mirichoi0218/insurance/home). This data set consists of 7 columns: age, sex, bmi, children, smoker, region and charges. We will get into more details about these variables later.

The key questions that we would be asking are:

  1. Is there a relationship between medical charges and other variables in the dataset?
  2. How strong is the relationship between the medical charges and other variables?
  3. Which variables have a strong relation to medical charges?
  4. How accurately can we estimate the effect of each variable on medical charges?
  5. How accurately can we predict future medical charges?
  6. Is the relationship linear?
  7. Is there synergy amont the predictors?

We start with importing the required libraries:

library(magrittr)
library(purrr)
## 
## Attaching package: 'purrr'
## The following object is masked from 'package:magrittr':
## 
##     set_names
library(MASS)
library(car)
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
## 
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
## 
##     some
library(broom)
## Warning: package 'broom' was built under R version 3.4.4
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
library(psych)
## Warning: package 'psych' was built under R version 3.4.4
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## The following object is masked from 'package:car':
## 
##     logit
library(caret)
## Warning: package 'caret' was built under R version 3.4.4
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

We import the data from the csv. We can see an overview of the data using summary() function.

insurance <- read.csv('insurance.csv')
summary(insurance)
##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

The key points that can be taken from the summary are:

  1. The age of participants varies from 18 to 64.
  2. Around 49.48% of participants are female.
  3. The bmi of participants ranges from 15.96 to 53.13.
  4. Only 20.48% of the participants are smokers.

Is there a relationship between the medical charges and the predictors?

Linear regression follows the formula :

y = beta+ .

The coefficients in this linear equation denote the magnitude of additive relation between the predictor and the response.

As such, the null hypothesis would be that there is no relation between any of the predictors and the response, which would be possible when all the coefficients for the predictors are 0. The alternate hypothesis would be that atleast one of the predictors has a relation with the outcome, that is the coefficient of one of the predictors is non-zero.

This hypothesis is tested by computing the F-statistic. in case of no relationship between the predictor and the response, F-statistic will be closer to 1. On the contrary, if the alternate hypothesis is true, the F-statistic will be greater than 1. The p-value of F-statistic can be calculated using the number of records (n) and the number of predictors, and can then be used to determined whether the null hypothesis can be rejected or not.

We will start with fitting a multiple linear regression model using all the predictors:

lm.fit <- lm(formula = charges~., data = insurance)
#charges~. is the formula being used for linear regression. Here '.' means all the predictors in the dataset.
summary(lm.fit)
## 
## Call:
## lm(formula = charges ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

A high value of F-statistic, with a significant p-value(<2.2e-16), implies that the null hypothesis can be rejected. This means there is a potential relationship between the predictors and the outcome.

RSE (Residual Standard Error) is the estimate of standard deviation of irreducible error. I simpler words, it is the average difference between the actual outcome and the outcome from the fitted regression line. Hence, a large value of RSE means a high deviation from the true regression line. As such, RSE is useful in determining the lack of fit of the model to the data. RSE in our model is large (6062), indicating that the model doeswn’t fit the data well.

R-squared measures the proportion of variability in Y that can be explained by X, and is always between 0 and 1. A high value of R-squared (0.7494) shows that around 75% of variance of the data is being explained by the model.

Which variables have a strong relation to medical charges?

If we look at the p-values of the estimated coefficients above, we see that not all the coefficients are statistically significant. This means that only a subset of the predictors are related to the outcome. The question is which one.

We can look at the individual p-values for selecting the variables. This may not be a problem when the number of predictors(7) is quite small compared to the number of observations(1338). This method won’t, however, work when number of predictors is greater than the number of observations. In such cases, we would have to use the feature/variable selection methods, like forward selection, backward selection, or mixed selection. Before jumping on to feature selection using any of these methods, let us try regression using the features with significant p-values.

lm.fit.sel <- lm(charges~age+bmi+children+smoker+region, data = insurance)
summary(lm.fit.sel)
## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11367.2  -2835.4   -979.7   1361.9  29935.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11990.27     978.76 -12.250  < 2e-16 ***
## age                256.97      11.89  21.610  < 2e-16 ***
## bmi                338.66      28.56  11.858  < 2e-16 ***
## children           474.57     137.74   3.445 0.000588 ***
## smokeryes        23836.30     411.86  57.875  < 2e-16 ***
## regionnorthwest   -352.18     476.12  -0.740 0.459618    
## regionsoutheast  -1034.36     478.54  -2.162 0.030834 *  
## regionsouthwest   -959.37     477.78  -2.008 0.044846 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7496 
## F-statistic: 572.7 on 7 and 1330 DF,  p-value: < 2.2e-16

We will compare this to mixed variable selection, which is a combination of forward selection and backward selection.

step.lm.fit <- stepAIC(lm.fit, direction = "both", trace = FALSE)
summary(step.lm.fit)
## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11367.2  -2835.4   -979.7   1361.9  29935.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11990.27     978.76 -12.250  < 2e-16 ***
## age                256.97      11.89  21.610  < 2e-16 ***
## bmi                338.66      28.56  11.858  < 2e-16 ***
## children           474.57     137.74   3.445 0.000588 ***
## smokeryes        23836.30     411.86  57.875  < 2e-16 ***
## regionnorthwest   -352.18     476.12  -0.740 0.459618    
## regionsoutheast  -1034.36     478.54  -2.162 0.030834 *  
## regionsouthwest   -959.37     477.78  -2.008 0.044846 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7496 
## F-statistic: 572.7 on 7 and 1330 DF,  p-value: < 2.2e-16

The model given by stepwise selection is same as the model we got by selecting predictors with significant p-values; so the simple method of selecting the coefficients on the basis of p-values works in this case.

We can see that there is a very slight improvement in R-squared value of the model(0.7494 -> 0.7496), with a very slight deterioration in RSE. (6062 -> 6060)

Some key inferences to be taken from the model are:

  1. Charges increase with increase in age of the key beneficiary. For every 1 year increase in age of the key benificiary, keeping everything else fixed, charges increase by around $256.97.
  2. Similar relations can be seen for other predictors. Higher charges are expected with higher BMI or higher number of children/dependents or if the person is a smoker.

Is the relationship linear?

By applying linear regression, we are assuming that there is a linear relationship between the predictors and the outcome. If the underlying relationship is quite far from linear, then most of the inferences we have made so far are doubtful. This also means reduced accuracy of model.

The non-linearity of the model can be determined using residual plots. For multiple linear regression, we can plot the residuals versus fitted values. Presence of a pattern in the residual plots would imply a problem with the linear assumption of the model.

residualPlot(step.lm.fit, type = "pearson", id=TRUE)

The blue line is a smooth fit of quadratic regression of Residuals as response and the Fitted values as the regressor. Thecurve is quite close to a straight line, indicating that the underlying data approximately follows linearity. (That number 1301 and 578; we’ll get to that later)

We can further plot the residual plots of individual predictors and residuals to see if any of the predictors demonstrate non-linearity.

residualPlots(step.lm.fit)

##            Test stat Pr(>|Test stat|)    
## age           3.8860         0.000107 ***
## bmi          -2.2167         0.026811 *  
## children     -0.9573         0.338601    
## smoker                                   
## region                                   
## Tukey test   11.8537        < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We don’t see any non-linearity with respect to individual predictors either.

par(mfrow=c(2,2))
plot(step.lm.fit)

Correlation of error terms

An important assumption of linear regression model is that the consecutive error terms are uncorrelated. The standard errors of the estimated regression coefficients are calculated on this basis. Hence, if the consecutive error terms are correlated, the standard errors of the estimated regression coefficients may be much larger.

We can check the auto-correlation of residuals using the Durbin-Watson test. The null hypothesis is that the residuals have no auto-correlation. The alternate hypothesis is that the the residuals have a statistically significant correlation:

set.seed(1)
# Test for Autocorrelated Errors
durbinWatsonTest(step.lm.fit, max.lag = 5, reps=1000)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.04582739      2.088964   0.104
##    2     -0.03919652      2.075623   0.138
##    3     -0.02781912      2.052635   0.322
##    4      0.02815118      1.933913   0.268
##    5     -0.01764343      2.025369   0.582
##  Alternative hypothesis: rho[lag] != 0

Here we are checking for auto-correlation of residuals for 5 different lags. The p-value for none of the lags is less than 0.05. Hence, we cannot reject the null hypothesis.

res <- step.lm.fit$residuals %>%
  tidy 
res$names <- as.numeric(res$names)
res%>%
  ggplot +
  geom_point(aes(x=names, y=x)) +
  labs(x='index', y='residuals')

Non-constant variance of error terms

Constant variance of residuals is another assumption of a linear regression model. The error terms may, for instance, change with the value of the response variable. One of the methods of identifying non-constant variance of errors is presence of a funnel shape in the residual plot. A more concrete way is an extension of the Breusch-Pagan Test, available in R as ncvTest() in the cars package. It assumes a null hypothesis of constant variance against the alternate hypothesis that the error variance changes with the level of the response or with a linear combination of predictors.

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(step.lm.fit)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 235.2819    Df = 1     p = 4.20249e-53
#lmtest::bptest(step.lm.fit)
# plot studentized residuals vs. fitted values 
spreadLevelPlot(step.lm.fit)
## Warning in spreadLevelPlot.lm(step.lm.fit): 
## 20 negative fitted values removed

## 
## Suggested power transformation:  0.2302333

A very low p-value(~4.2e-53) means the null hypothesis can be rejected. In other words, there is a high chance that errors have a non-constant variance. From the graph, we can also see how the spread of absolute studentized residuals is varying with increased value of fitted values. On of the methods to fix this problem is transformation of the outcome variable.

bc <- boxcox(step.lm.fit, plotit = F) 
powTrans <- bc$x[which.max(bc$y)]
chargesBCTrans <- BoxCoxTrans(insurance$charges)
chargesBCTrans
## Box-Cox Transformation
## 
## 1338 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1122    4740    9382   13270   16640   63770 
## 
## Largest/Smallest: 56.8 
## Sample Skewness: 1.51 
## 
## Estimated Lambda: 0 
## With fudge factor, Lambda = 0 will be used for transformations
insurance$chargesTrans <- predict(chargesBCTrans, insurance$charges)
head(insurance)
##   age    sex    bmi children smoker    region   charges chargesTrans
## 1  19 female 27.900        0    yes southwest 16884.924     9.734176
## 2  18   male 33.770        1     no southeast  1725.552     7.453302
## 3  28   male 33.000        3     no southeast  4449.462     8.400538
## 4  33   male 22.705        0     no northwest 21984.471     9.998092
## 5  32   male 28.880        0     no northwest  3866.855     8.260197
## 6  31 female 25.740        0     no southeast  3756.622     8.231275
trans.lm.fit <- update(step.lm.fit, charges^0.41~.)

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(trans.lm.fit)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.008299572    Df = 1     p = 0.9274116
# plot studentized residuals vs. fitted values 
spreadLevelPlot(trans.lm.fit)

## 
## Suggested power transformation:  0.3611456
plot(insurance$age, insurance$charges)

update(trans.lm.fit, ~age+I(log(bmi))+I(children^0.5)+smoker+region) %>%
  #ncvTest()
  #summary()
  residualPlots()

##                 Test stat Pr(>|Test stat|)    
## age                1.6556          0.09804 .  
## I(log(bmi))       -0.8427          0.39956    
## I(children^0.5)    0.9385          0.34814    
## smoker                                        
## region                                        
## Tukey test        -4.5980        4.266e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Extending the linear model

lm(charges~age+bmi+children+smoker, data = insurance) %>%
  summary()
## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11897.9  -2920.8   -986.6   1392.2  29509.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12102.77     941.98 -12.848  < 2e-16 ***
## age            257.85      11.90  21.675  < 2e-16 ***
## bmi            321.85      27.38  11.756  < 2e-16 ***
## children       473.50     137.79   3.436 0.000608 ***
## smokeryes    23811.40     411.22  57.904  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6068 on 1333 degrees of freedom
## Multiple R-squared:  0.7497, Adjusted R-squared:  0.7489 
## F-statistic: 998.1 on 4 and 1333 DF,  p-value: < 2.2e-16
tidy(lm.fit)
##              term    estimate std.error   statistic      p.value
## 1     (Intercept) -11938.5386 987.81918 -12.0857530 5.579044e-32
## 2             age    256.8564  11.89885  21.5866552 7.783217e-89
## 3         sexmale   -131.3144 332.94544  -0.3944020 6.933475e-01
## 4             bmi    339.1935  28.59947  11.8601306 6.498194e-31
## 5        children    475.5005 137.80409   3.4505546 5.769682e-04
## 6       smokeryes  23848.5345 413.15335  57.7232020 0.000000e+00
## 7 regionnorthwest   -352.9639 476.27579  -0.7410914 4.587689e-01
## 8 regionsoutheast  -1035.0220 478.69221  -2.1621870 3.078174e-02
## 9 regionsouthwest   -960.0510 477.93302  -2.0087563 4.476493e-02
plot(lm.fit)

# Assessing Outliers
outlierTest(lm.fit) # Bonferonni p-value for most extreme obs
##      rstudent unadjusted p-value Bonferonni p
## 1301 5.009599         6.1887e-07   0.00082805
## 578  4.219800         2.6112e-05   0.03493800
qqPlot(lm.fit, main="QQ Plot") #qq plot for studentized resid 

## [1] 1231 1301
leveragePlots(lm.fit) # leverage plots

lm.fit <- lm(log(charges)~age+bmi+children+smoker, data = insurance)
summary(lm.fit)
## 
## Call:
## lm(formula = log(charges) ~ age + bmi + children + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11340 -0.19883 -0.04688  0.07197  2.07581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.9827766  0.0697227 100.151  < 2e-16 ***
## age         0.0347826  0.0008805  39.502  < 2e-16 ***
## bmi         0.0106096  0.0020264   5.236 1.91e-07 ***
## children    0.1011976  0.0101989   9.922  < 2e-16 ***
## smokeryes   1.5432438  0.0304372  50.703  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4491 on 1333 degrees of freedom
## Multiple R-squared:  0.7622, Adjusted R-squared:  0.7614 
## F-statistic:  1068 on 4 and 1333 DF,  p-value: < 2.2e-16
plot(lm.fit)

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(lm.fit)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 69.54001    Df = 1     p = 7.488006e-17
# plot studentized residuals vs. fitted values 
spreadLevelPlot(lm.fit)

## 
## Suggested power transformation:  1.253853
# Test for Autocorrelated Errors
durbinWatsonTest(lm.fit)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.02799903      2.054062   0.274
##  Alternative hypothesis: rho != 0
# Assessing Outliers
outlierTest(lm.fit) # Bonferonni p-value for most extreme obs
##      rstudent unadjusted p-value Bonferonni p
## 517  4.664824         3.3996e-06    0.0045487
## 220  4.643317         3.7673e-06    0.0050407
## 431  4.611197         4.3883e-06    0.0058715
## 103  4.583994         4.9899e-06    0.0066765
## 1028 4.493947         7.5979e-06    0.0101660
## 527  4.304437         1.7966e-05    0.0240390
## 1020 4.273503         2.0611e-05    0.0275780
## 1040 4.231808         2.4769e-05    0.0331410
qqPlot(lm.fit, main="QQ Plot") #qq plot for studentized resid 

## [1] 220 517
leveragePlots(lm.fit) # leverage plots

vif(lm.fit)
##      age      bmi children   smoker 
## 1.014498 1.012194 1.001950 1.000745
# Evaluate Nonlinearity
# component + residual plot 
crPlots(lm.fit)

# Ceres plots 
ceresPlots(lm.fit)
## Warning in ceresPlots(lm.fit): Factors skipped in drawing CERES plots.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -0.025
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.025
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 4.018e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -0.025
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.025
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 4.018e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4

Sources : 1. https://www.kaggle.com/mirichoi0218/insurance/home 2. An Introduction to Statistical Learning and Reasoning 3. Wikipedia 4. https://www.statmethods.net/stats/rdiagnostics.html 5. https://www.statmethods.net/stats/regression.html 6. https://datascienceplus.com/how-to-detect-heteroscedasticity-and-rectify-it/